Skip to content

Add support for advanced.config #221

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jul 29, 2020
Merged

Add support for advanced.config #221

merged 7 commits into from
Jul 29, 2020

Conversation

ChunyiLyu
Copy link
Contributor

@ChunyiLyu ChunyiLyu commented Jul 22, 2020

This closes #219

Note to reviewers: remember to look at the commits in this PR and consider if they can be squashed

Summary Of Changes

  • configured through spec.rabbitmq.advancedConfig as a string
  • updating advancedConfig will trigger a sts rolling restart bc similar to rabbitmq.conf, advanced config is loaded at node start
  • before restarting the statefulSet, controller runs rabbitmqctl eval <advanced.config> to validate that the input has correct syntax. rabbitmqctl eval evaluate the input, so it's not a minimal syntax validation. Some configurations in advanced.config can be loaded while rabbit is running, and some can only to loaded at node start. I have look into erl -noshells -eval or writing an escript, but could not figure out how to successfully use them. So I settled at using rabbitmqctl eval.

Additional Context

Local Testing

Have run unit, integration and system tests.

This is tested at the unit and system tests level.

- related to issue #
- configured through `spec.rabbitmq.advancedConfig` as
a string
- updating advancedConfig will trigger a sts rolling restart
bc similar to `rabbitmq.conf`, advanced config is loaded at
node start
@@ -4476,6 +4472,10 @@ spec:
type: string
maxItems: 100
type: array
advancedConfig:
description: Specify any rabbitmq advanced.config configurations
maxLength: 2000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this limit required? 2000 characters can be exceeded in environments with extensive LDAP queries and OAuth 2 settings, for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not required. There is a hard limit in kubernetes that all objects are stored in etcd, and etcd has a limit of 1MB per object which translate to 1048576 characters. I will update the limit to 100000. Thanks for pointing it out Michael!

Zerpet
Zerpet previously approved these changes Jul 23, 2020
Copy link
Member

@Zerpet Zerpet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a cosmetic suggestion, nothing that stops the ✅ 👌

Comment on lines 146 to 147
advancedConfig, ok := configMap.Data["advanced.config"]
Expect(ok).To(BeTrue())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cosmetic change request 🙃 Let's annotate the assertion so that if this ever fails, we get a message more descriptive than "expected false to be true"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in the new commit ;) good call Aitor

@ChunyiLyu
Copy link
Contributor Author

Working on adding syntax validation before updating advance.config and restarting the statefulSet

ChunyiLyu and others added 2 commits July 27, 2020 20:28
* Add client service override `spec.override.clientService`

- this closes #142
- add an additional port example; using the client service override
- make sts and svc override's name and namespace not configurable
Pinning to a specific SHA no longer makes sense with no OSL process.
Copy link
Member

@Zerpet Zerpet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the "failure" scenario where an invalid advanced config is provided, I don't think we are loud enough to surface that there is a problem. I think that updating the status condition and logging an event is not enough, because the user has to decide to take the additional step of describing the RabbitmqCluster resource.

In the current implementation, I can't think of any motivation for the user to take a diagnosing step, given that all pods are running and "everything just works". It'll be at a later time, when RabbitMQ is not doing what's expected (due to advanced config not being loaded), that a user would suspect there is something not right. And even at this point, I don't think that describing the custom resource is a step one would think of taking as a first step.

I think this safeguard is protecting availability in a scenario where availability is not expected i.e. single RabbitMQ node. In a single node scenario, a user was expecting to lose availability due to the node restart to apply the advanced config. I don't think that preventing the restart (when it was expected) and never reconciling again (due to #225) are a better alternative to letting RMQ crash due to invalid configuration.

In the scenario of a 3 node cluster, the StatefulSet controller will preserve availability of the cluster because it won't roll the surviving 2 replicas until the first one is ready which will never be due to the invalid advanced config.

@Zerpet Zerpet dismissed their stale review July 28, 2020 13:07

The latest changes have modified the behaviour of the Operator and the original review is no longer valid.

@ChunyiLyu
Copy link
Contributor Author

@Zerpet

You are totally right that the implementation here is far from ideal. This is part of the reason why I think it is important for us to figure out a more reliable way to load updated configurations and created issue #225

In my opinion, it seems very easy for anyone to make syntax mistakes with their advanced config, and a rabbitmq-server node crash is something we would like to avoid. It's true that users expects downtime when they updates configurations because of the restart. However, a node crashing not only occurs downtime but also panic and it sounds like unnecessary pain for our users. I think given that users expect a restart of the statefulSet when they update configurations, and they are most likely manually performing this action, the fact that statefulSet stays running and are not restarted is a signal to them that something has gone wrong. status.condition is also the recommended way to monitor k8s objects, and users should be actively monitoring that.

This all depends on how much we want to prevent a server crash. If a server crash is something users are totally ok with and are used to handle, remove the validation and let it happen makes sense to me. However, if it is a failure mode that causes pain for users, I do not see why we won't add an additional validation to avoid arriving there.

I would like to hear other's opinions on this subject as well. Thoughts? @mkuratczyk @j4mcs

@Zerpet
Copy link
Member

Zerpet commented Jul 28, 2020

However, a node crashing not only occurs downtime but also panic and it sounds like unnecessary pain for our users. I think given that users expect a restart of the statefulSet when they update configurations, and they are most likely manually performing this action, the fact that statefulSet stays running and are not restarted is a signal to them that something has gone wrong.

I think that a node crash, when a user was expecting downtime, does not incur in panic or extra pain. The user was expecting downtime anyway. In my opinion, a Pod that enters a crash loop backoff is a more evident, direct, right-in-your-face symptom that there is a problem. A Pod that is just running, when you expected it to roll, doesn't seem so evident to me that there is a problem. Maybe the change did not require a Pod restart? Plugins do not restart Pods, so maybe this change doesn't either? Maybe the update has not been scheduled yet? It's not evident unless the user takes further actions.

status.condition is also the recommended way to monitor k8s objects, and users should be actively monitoring that.

I've been wondering how true this statement is in the world outside. For us, it is very clear. I have not yet heard any mention of conditions, or the CR status in any feedback communication we had so far. I've been wondering for a while if the status conditions isn't something we convinced ourselves that "it is the way". Even when I'm troubleshooting a problem during development, I rarely check status conditions or status objects unless I have a good reason to; instead I find myself doing kubectl get all and kubectl logs. I'm not convinced that "you should monitor your status conditions" is a widely adopted or understood practice.

This all depends on how much we want to prevent a server crash.

Phrasing my arguments in other way: what does a server crash cause? Downtime. A single RabbitMQ node was expecting downtime, so we are not avoiding an unexpected event. In a 3 node cluster, what does one Pod in crash loop backoff cause? Degraded availability, no downtime. The service continuity remains intact.

I definitely think that a Pod in crash loop backoff is a much better option than silently masking the error in a status condition.

@mkuratczyk
Copy link
Collaborator

I see good points and both sides and haven't made my mind up yet. However, aren't validation webhooks the right tool for this job? If we just rejected changes to the CR if advancedConfig is invalid then we prevent downtime and we are explicit about the problem.

@ChunyiLyu
Copy link
Contributor Author

ChunyiLyu commented Jul 29, 2020

@Zerpet @mkuratczyk validation webhooks will be a more ideal place. I've deleted the logic on validation.

The branch history looks a bit messy at the moment. I will squash merge.

@j4mcs
Copy link
Contributor

j4mcs commented Jul 29, 2020

I'm a bit wary of spending too much time in K8s concerned with a RabbitMQ fault. Ultimately if the user misconfigures their RabbitMQ cluster, the failure mode will be determined by RabbitMQ. I feel like we're trying to patch behaviour of the underlying technology with reconciliation logic. If an invalid advanced config should result in a spectacular failure of the RabbitMQ cluster then that's the core's responsibility.

I agree that we've made a conscious opinionated decision that conditions are where we surface underlying faults. But I don't see an issue with that. Ultimately, we've taken a cue from Knative and it's definitely not an established standard.

I worry that unless we let misconfigurations like these run their course we're going to have more and more edge cases bloat reconciliation.

All that to say I'm happy with the code as it is. I'd be up for exploring a validating webhook, but I think it needs to be opt in, least of all because it will mean loading erlang to do syntax checking.

@Zerpet
Copy link
Member

Zerpet commented Jul 29, 2020

I agree and resonate with James' comment. I'm happy with the current state and in favour of exploring validating webhooks.

@ChunyiLyu ChunyiLyu merged commit 7a6946b into master Jul 29, 2020
@ChunyiLyu ChunyiLyu deleted the advancedCon branch July 29, 2020 11:27
@ChunyiLyu
Copy link
Contributor Author

In my opinion, the benefit our operator can provide, is the improvement on operating core rabbitmq on k8s. The operator should make the experience of deploying and running rabbitmq on k8s easier for users than them using own solution. If we were to say that experience improvement does not belong in the operator, then how are we offering anything different from a templating tool that the bitnami chart is offering for 2 years already? I don't know the answer to this question, nor do I have a clean definition of what counts as a reasonable experience improvement. However, I disagree with the the statement that we should avoid patching rabbitmq behavior with our operator.

The PR is merged. I'm fine the next step that is proposed. We can chat about the responsibility boundary of the operator some other time. This will definitely be a topic that comes up repeatedly.

ChunyiLyu added a commit that referenced this pull request Aug 25, 2020
* Add support for advanced.config

- related to issue #219 
- configured through `spec.rabbitmq.advancedConfig` as
a string
- updating advancedConfig will trigger a sts rolling restart
bc similar to `rabbitmq.conf`, advanced config is loaded at
node start
- set maxLength for advancedConfig to 100000
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for advanced config
6 participants